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Abstract 

Statistical models that possess symmetry arise in diverse settings such as ran- 
dom fields associated to geophysical phenomena, exchangeable processes in Bayesian 
statistics, and cyclostationary processes in engineering. We formalize the notion of 
a symmetric model via group invariance. We propose projection onto a group fixed 
point subspace as a fundamental way of regularizing covariance matrices in the high- 
dimensional regime. In terms of parameters associated to the group we derive precise 
rates of convergence of the regularized covariance matrix and demonstrate that signif- 
icant statistical gains may be expected in terms of the sample complexity. We further 
explore the consequences of symmetry on related model-selection problems such as the 
learning of sparse covariance and inverse covariance matrices. We also verify our results 
with simulations. 

1 Introduction 

An important feature of many modern data analysis problems is the small number of sam- 
ples available relative to the dimension of the data. Such high- dimensional settings arise in 
a range of applications in bioinformatics, climate studies, and economics. A fundamental 
problem that arises in the high-dimensional regime is the poor conditioning of sample statis- 
tics such as sample covariance matrices [9] ,[8]. Accordingly, a fruitful and active research 
agenda over the last few years has been the development of methods for high-dimensional 
statistical inference and modeling that take into account structure in the underlying model. 
Some examples of structural assumptions on statistical models include models with a few 
latent factors (leading to low-rank covariance matrices) [18], models specified by banded or 
sparse covariance matrices [9], [8], and Markov or graphical models [25, 31, 27]. 

The focus of this paper is on exploiting group symmetries in covariance matrices. Models 
in which the distribution of a collection of random variables is invariant under certain per- 
mutations of the variables have been explored in many diverse areas of statistical research 
[22]. Symmetry assumptions have played a prominent role within the context of covariance 
matrices and multivariate Gaussians [1] and these assumptions are of interest in numerous 
applications (see [22] for a detailed list). 

We systematically investigate the statistical and computational benefits of exploiting 
symmetry in high-dimensional covariance estimation. As a very simple example consider 



the estimation of the variances of p independent random variables from n samples of the 
variables. Since the variables are independent, this problem is equivalent to estimating the 
variances of each variable separately from n samples. Suppose we are additionally given that 
the variables are identically distributed - we now need to estimate just a single parameter 
from what are effectively n x p samples. This very elementary example demonstrates the 
potential for improvements in the sample complexity as well as computational complexity in 
models with symmetries. We generalize this basic insight to much more complicated settings 
in which the distribution underlying a collection of random variables may be symmetric with 
respect to an arbitrary subgroup of the symmetric group. More formally, we investigate 
multivariate Gaussians specified by invariant covariance matrices: 

e = n.snj vn 9 e 0, 

where is some subgroup of the symmetric group, i.e., the group of all permutations. 
Associated to each subgroup of the symmetric group is the fixed-point subspace W© of 
matrices that is invariant with respect to conjugation by each element of 0. This subspace 
plays a fundamental role via Schur's lemma from representation theory in our analysis of 
the benefits in sample complexity and computational complexity of regularizing invariant 
covariance matrices. 

The general advantages of symmetry exploitation are numerous 

• Problem size: One advantage is that when symmetry is incorporated in the model, the 
problem size often reduces, and the new instance can have significantly fewer number of 
variables and constraints, which can lead to dramatic computational speed-ups. This 
is exploited for example in finite-element methods for partial differential equations [19] . 

• Better statistical properties: As we will see in this paper, exploiting symmetry also 
leads to statistical gains in terms of order-of-magnitude gains in the sample complexity 
of model selection. 

• Numerical benefits: Another advantage is the removal of degeneracies in the problem 
that arises from the high multiplicity of eigenvalues that is typically associated to 
symmetric models. Such multiplicities are a common source of difficulties in numerical 
methods, and they can be properly handled by suitably exploiting symmetry in the 
problem. Symmetry-aware methods in general have better numerical conditioning, are 
more reliable and lead to faster and more accurate solutions. 

This paper consists of three main technical contributions. First, we precisely quantify 
in terms of properties of the improvement in rates of convergence when the sample co- 
variance matrix (obtained from n samples) is projected onto Wg. Our analysis holds for 
the spectral or operator norm, the Frobenius norm, and the norm (maximum absolute 
entry). Second, we study the implications of these improved rates of convergence by spe- 
cializing recent results on covariance estimation of Bickel and Levina [8] , and of Ravikumar 
et al. [31] to our setting with invariant covariance matrices. These results quantitatively 
demonstrate the benefits of taking symmetry structure into account in covariance estimation 
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tasks. Finally, we discuss the computational benefits of incorporating symmetry in covari- 
ance regularization. Specifically, we describe how large-dimensional convex programs based 
on regularized maximum-likelihood can be simplified significantly to smaller-sized convex 
programs by incorporating prior information about symmetry. 

Symmetry and its consequences in the context of statistics, and more broadly in compu- 
tational mathematics, has been well studied. We mention some related work here. In the 
Gaussian setting, work has been done related to testing of symmetry in statistical models 
[39, 30, 37]. In the context of Markov random fields symmetry restrictions have been studied 
in [38, 6, 7, 23, 2, 26]. We also mention the work of Hojsgaard and Lauritzen [22], who study 
edge and vertex symmetries for covariance and inverse covariance (graphical models) and 
exploitation of symmetry for inference problems. They also cite interesting examples from 
the social sciences (such as distribution of test scores among students and results of psycho- 
logical tests) where symmetric structure has been experimentally observed. More broadly, 
symmetry has played an important role in other computational problems such as optimiza- 
tion [36, 21, 16, 24], in the computation of kissing numbers and sphere-packing problems [4], 
in coding theory [33], in truss topology optimization [5], and in the design of rapidly mixing 
Markov chains [10]. 

In the remainder of Section 1 we explain a few specific applications where exploiting 
symmetry in statistical models is of interest. We also set up some of the notation that will 
be used throughout the paper. In Section 2 we introduce the necessary group theoretic pre- 
liminaries, the notion of (3-invariant covariance matrices, and related statistical quantities. 
In Section 3 we state some of our main results that quantify the statistical gains when sym- 
metry is exploited judiciously. In Section 4 we extend our results from Section 3 to selecting 
sparse covariance and inverse covariance models. In Section 5, we make some brief remarks 
regarding how computational benefits can be gained when selecting symmetric models. In 
Section 6, we present some simulation results that clarify the statistical gains one may expect 
when symmetry is exploited. In Section 7, we state some concluding remarks. 

1.1 Applications 

In this section we motivate the problem under consideration by briefly describing a few 
applications in which estimation of symmetric covariance structure arises naturally. 

1.1.1 Random Fields in Physical Phenomena 

Many signals that capture the spatial variability of natural phenomena are described in the 
framework of random fields [28] . Random fields arise in a variety of engineering contexts, for 
example the temperature distribution of materials, the circulation of ocean fields in oceanog- 
raphy, or the transport of groundwater in hydrology [28]. A classical model is associated 
with the well-known Poisson's equation: 

V 2 (f)(x) = f(x) xeR d (1) 
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where V 2 is the Laplace operator. In hydrology, this equation governs the steady state flow 
of a fluid in M 3 , whereas in electrostatics it governs the distribution of the potential in a 
variety of settings such as transmission lines [28]. Under the right technical conditions, if 
f(x) is a white noise process, the stochastic process <p(x) in (1) is a Gauss-Markov random 
field [32]. While Poisson's equation (1) is a sample-path representation of the process 4>(x), 
a particularly useful representation of this process is the covariance R(xi,X2) associated to 
this random field. Indeed the covariance R(xi, x?) is the Green's function of the biharmonic 
equation, and its properties have been studied in detail by Moura and Goswami [28]. 

A useful observation that is pertinent in our context is that the Laplacian as well as the 
biharmonic equation are invariant under the action of the orthogonal group, i.e. they are 
isotropic. If in addition the forcing process f(x) has isotropic stochastic properties (one such 
example being a white noise process) and the boundary conditions of the physical process 
are also isotropic, then the covariance R(xi,X2) will also be isotropic. 

When such random fields are encountered in practice, the usual engineering approach 
is to discretize the domain so as to make the problem computationally tractable. The 
restriction of the covariance R{x\,X2) to the finite discretization of the domain results is 
a covariance matrix R. Recognizing the symmetry in the problem, it is often useful to 
construct a symmetry-preserving discretization of the space [29] . Such a modeling paradigm 
naturally leads to a covariance matrix that is invariant with respect to a discrete group (3 
which is a finite subgroup of the orthogonal group. A motivating question in this paper then 
is: 

"Given samples of the stochastic process <p(x) on this discretized domain, can one exploit the 
symmetry of the space to obtain a good statistical approximation of the covariance matrix 
RT 

1.1.2 Bayesian Models 

A fundamental notion in Bayesian statistics is that of exchangeability. Indeed, in such 
settings one often deals with a collection of random variables A 1; . . . ,X n where it is not 
appropriate to assume that this collection is i.i.d.. It is far more natural to assume instead 
that the collection of random variables is exchangeable or partially exchangeable. We give 
two such examples: 

1. Urn models: Consider an urn with n r red balls and n& blue balls. A quantity T 
balls are randomly drawn without replacement from the urn. Let us define the random 
variable: 

J if the i th ball is red 
i = \ 1 if the i th ball is blue. 

Clearly the random variables X\, . . . ,X? are not i.i.d. On the other hand, it is well- 
known that this sequence of random variables is exchangeable, i.e. for any permutation 
a on the indices 1, . . . , T and for any set A 

P ((*!, ...,x T )eA) = F ((X CT(1) , . . . , x a{T) ) e A) . 
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In the terminology developed in this paper, this distribution is invariant with respect 
to the group <8 corresponding to the symmetric group. Such urn models are commonly 
used metaphors for problems related to computing statistical properties of a population 
of objects. 

Suppose the number of red and blue balls in the urn is unknown, and one wants to 
estimate the probability of drawing two consecutive blue balls in the first two draws 
empirically from multiple trials of T draws each. It is an easy exercise to check that: 

E[X 1 X 2 ]=F(X 1 = 1,X 2 = 1). 

By exchangeability E [X1X2] = E [XjXj] for any i,j e {1, . . . , T}. Hence the computa- 
tion of the required probability reduces to estimating the empirical covariance matrix. 
It is important to exploit the underlying symmetry in this computation. 

2. Clinical tests: Suppose a clinical test is conducted on a set of T patients wherein 
a drug is administered to the patients and a physiological response x\, x 2 , ■ ■ ■ , is 
measured for each patient. The set of patients are divided into groups Ti, . . . ,Th 
of patients of similar physiological properties (say based on characteristics such as 
combinations of similar age, ethnicity, sex, and body size). It is often reasonable to 
assume that the distributions of the responses Xj of patients within the same class are 
exchangeable (but not i.i.d.). Sources of non-independence include latent factors such 
as residual effects of a previous experiment, physical effects such as the temperature, 
etc. Indeed from a Bayesian perspective it is more natural to assume that the responses 
are exchangeable as opposed to i.i.d.. Thus, the distribution of (Xi, . . . ,X n ) may be 
assumed to be invariant with respect to permutation of patient labels within the same 
class Tj. Specifically, if Sym fc denotes the symmetric group of permutations on k 
indices then the the distribution of (X 1; . . . ,X n ) is invariant under the action of the 
group <5 = Syni| Ti | x . . . x Syru^. In such clinical tests, it is of importance to estimate 
the covariance matrix of the responses from a limited number of trials. 

1.1.3 Cyclostationary processes 

An interesting special case of random phenomena with group symmetries are the class of 
well-studied stochastic processes known as cyclostationary processes [20]. These processes 
arise in random phenomena that obey periodic stochastic properties, for example the yearly 
variation of temperature in a particular location. Such processes are frequently encountered 
in engineering and arise in a variety of contexts such as communication systems (presence of a 
sinusoidal component), radar (rotating elements), and the study of vibrations in mechanical 
engineering. Moreover, a special case of such processes are the classical stationary processes, 
which are ubiquitous in engineering. 

Formally, a discrete time stochastic process {X t } tgZ is said to be cyclostationary with 
period N if for any measurable set A 

P ((X tl , . . . XJ e A) = P (X tl+JV , . . . X tk+N ) e A) for all h,...,t k eZ. 



5 



Thus the joint probability distribution of such a process is invariant under shifts of N time 
units. In the particular instance when N = 1 we obtain a stationary process. An important 
statistical property associated to this process is the auto- correlation 

R(t, r) := E [XtXr] ■ 

If we restrict attention to the first kN time units for some integers k, it is clear that the 
restriction of the autocorrelation admits a finite representation as a matrix R e ^ kNxkN _ T n 
the stationary case (i.e. N — 1), from shift invariance it is also apparent that R(t, r) depends 
only on \t — r|, so that the resulting covariance matrix is Toeplitz. A commonplace and 
particularly useful approximation for large Toeplitz matrices is to convert them to circulant 
matrices [35]. In the more general cyclostationary case (the period N > 1), the matrix R is 
block Toeplitz and approximated via a block circulant one where each block is of size N x N. 

In group-theoretic terms, the covariance matrix R is invariant under the action of the 
cyclic group of order k (the group captures cyclic translations of the indices). This group 
is abelian, and its action on the index set is precisely the cyclic shift operator by iV units. 
As is well-known from the representation theory of the cyclic group, the Fourier transform 
(block) diagonalizes circulant matrices, and the resulting (block) diagonal matrix is called 
the spectrum of the autocorrelation. These observations are classical results and we mention 
them here to point out an interesting special case of our framework. 

1.2 Notation 

We describe some of the notation that will be adopted through the rest of this paper. We 
will commonly deal with a random vector X G 1R P , where p is the dimension of the space. 
Occasionally, we will encounter the complex vector space C p . We will be interested in 
inferring statistical properties of this random vector from i.i.d. samples X\, . . . ,X n where 
n is the number of samples. We will study the problem in a high- dimensional setting, i.e. 
where p>nso that the dimension of the space is much larger than the number of samples. 

We will mostly consider normally distributed random vectors X with zero mean and 
covariance S, we will denote this by jV(0, E). The inverse of the covariance is called the 
concentration matrix and we use the notation := E _1 . The probability of an event A will 
be denoted by F(A) and the expectation of a random variable Y is denoted by K[Y}. The 
indicator function of an event A is denoted by 1{A). If h(X) is a real valued function of a 
Revalued random vector, we will use the notation that h(X) = Op(f(p,ri)) to mean that 
h(X) < Cf(p,n) with probability exceeding 1 — ^ for some constants C and r > 0. 

Given a finite group (J5 which is a subgroup of the symmetric group, we consider the group 
action of (3 on the index set {1, . . . ,p}. Specifically, an element g 6 <S can be interpreted 
as a permutation, and g acts on {1, . . . ,p} by permuting these indices. To every g e one 
can associate a permutation matrix U g 6 MP. 

We will denote the positive-definiteness (respectively positive semidefiniteness) of a sym- 
metric matrix M by M y (M y 0). The space of symmetric positive definite matrices in 
M. pxp will be denoted by <S++. The maximum eigenvalue of a symmetric matrix M is denoted 
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by A max (M). Its trace (sum of the diagonal entries) will be denoted by Tr(M). We will 
encounter different norms on the space of symmetric matrices: 

1. Frobenius norm: Denoted by \\M\\ F = (Tr(M T M)) 5 . 

2. Spectral norm: Denoted by ||M|| is the maximum singular value of M. 

3. Elementwise norm: Denoted by UMH^, = maxjj |My|. 

4. The operator oo norm: Denoted by HMH^^ = max« J2j 

5. Off-diagonal l\ norm: Denoted by ||M|| 1]0ff = |My|- 

2 Symmetric Models 

2.1 Symmetry, Groups, and Group Representations 

The central concepts that formalize the notion of symmetry mathematically are groups and 
invariance. Abstractly, a physical object or process (in our case a covariance matrix) is 
said to possess symmetry if it is invariant under the action of a group. For background on 
groups and group actions we refer the reader to any standard algebra textbooks, for example 
[17, 3, 34]. In this paper we will be dealing with finite groups (5, i.e. groups whose cardinality 
\<5\ is finite. A basic result in group theory, Cayley's theorem, establishes that every finite 
group is isomorphic to a subgroup of the symmetric group (i.e. the group of permutations 
with the group operation being permutation composition). Consequently, elements of finite 
groups (3 can be conveniently represented via permutation matrices, and in our context these 
naturally induce an action of (3 on matrices M G M. pxp via conjugation. 

Specifically, letting U g denote the permutation matrix corresponding to g G <5, a group 
action is a map tp: 

<p : (5 x R pxp -> W xp 

(g,M)^U g MUl 

A matrix M is said to be invariant with respect to the group if M = n g MlTj for all g G (5. 
The set 

w® = {x g w xp | npi:n 9 = x vn s G 0} 

is called the fixed point subspace of the group (3 and it is the set of all matrices that are 
invariant with respect to the group. Informally, this is the set of matrices that possess 
"symmetry" with respect to the group. We let Po(-) : M pxp —> denote the Euclidean 
projection operator that projects a matrix onto the fixed point subspace. 

A fundamental tool in the study of invariance is the theory of group representations. We 
provide a brief review of concepts from group representation theory which are relevant for 
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our discussion. Let GL(C N ) denote the general linear group of invertible matrices in C NxN . 
A group representation is a group homomorphism 

p :0 GL(C N ) 
9 ^ p(#)- 

The representation naturally induces an action of (J5 onto the vector space via 

p:(3xC N ^C N 
(g,v) ^ p(g)v. 

Thus, the representation naturally associates to each g G (55 a linear transformation p(g) that 
acts on C . Since a linear transformation is characterized by its invariant subspaces, it is suf- 
ficient to study the invariant subspaces of the linear transformations p(g) for g G (55. Indeed, 
if p is a given representation and W C C w is a subspace, we say that is & -invariant with 
respect to p if p((?)W / C W for every g G 25. We say that a representation p is irreducible if 
the only (55-invariant subspaces are the trivial subspaces and C N . Schur's lemma, a classical 
result in group representation theory, establishes that finite groups have only finitely many 
irreducible representations, and provides the following orthogonal decomposition theorem of 
in terms of the (55-invariant subspaces. 

Lemma 2.1 (Schur's lemma [34]). For a finite group (55 there are only finitely many in- 
equivalent irreducible representations (indexed by X) . . . , -&\ X \ of dimensions si, . . . , s\%\- 
The dimensions Sj divide the group order |(55| ; and satisfy ^ igI s • = Every linear 

representation of & has a canonical decomposition 

p = mii?i © ... © m|x|#|x|, 

where the rrii are the multiplicities. Correspondingly , there is an isotypic decomposition of 
<£ N into invariant subspaces Wi - . 

C N = Wi®...® W\ X \ 

where each Wi is again a direct sum of isomorphic copies Wi = Wn © ... © Wi )Tni with 
multiplicity m{ . 1 

A basis of this decomposition (that depends only on the group) transforming with respect 
to the matrices $(g) is called symmetry adapted, and can be explicitly computed algorith- 
mically [34, 19]. This basis defines a change of coordinates by a unitary matrix T G C NxN . 
One of the main consequences of Schur's lemma is that the symmetry adapted basis block 

1 Schur's lemma, as stated classically, provides a decomposition over the complex field C . However, 
a real version can be adapted in a straightforward manner [34, pp. 106-109], [21] and the irreducible real 
representations are called absolutely irreducible. 
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diagonalizes the fixed point subspace, i.e. every matrix in the fixed point subspace is com- 
monly diagonalized by T. If M G W© is a matrix in the fixed point subspace, then changing 
coordinates with respect to T decompses M into a block diagonal form as follows: 



T*MT = 



M 1 











M, 



M = 



Bi 

a 



(2) 



In the above decomposition the diagonal blocks M,- L G IR m i s * xm ^ can be further decomposed 
into m, repeated diagonal blocks £?j G ]R SiXSi (recall that Sj are the sizes of the irreducible 
representations and raj are the multiplicities). Thus, the symmetry restriction M G 
reduces the degrees of freedom in the problem of interest. This observation plays a central 
role in our paper. 

Example 2.1. To fix ideas, we will consider the following three examples to illustrate our 
results throughout the paper. 

1. Cyclic group: The cyclic group of order p is isomorhpic to 7Ljp7L (the group operation 
is the addition operation +). Let M G W xp be a circulant matrix of the form: 



M = 



Mi M 2 
M p Mi 



M p 
M v _ x 



M M>, ... M x 



(3) 



If p(k) is a permutation matrix corresponding to cyclic shifts of size k of the indices 
{1, . . . ,p}, then it is clear that p(k)Mp(k) T = M . Indeed the fixed point subspace is 
precisely the set of circulant matrices and since the group is abelian, can be described 
as 

W© = {M G R pxp | Mp(l) = p(l)M} . 

(Since 1 is the generator of the cyclic group, invariance with respect to 1 implies in- 
variance with respect to the other elements). 

It is well known that if T = T G C pxp is the Fourier transform matrix, then T*MT 
is block diagonal. (The columns of T form the symmetry adapted basis). Here \T\ = p 
(there are p irreducible representations) , mi — 1 and Sj = 1 for all i G X. In our 
setting, covariance matrices of such structure arise from cy do stationary processes. 

2. Symmetric group: Let us now consider <3 = Sym p , the full symmetric group of 
permutations on p indices. Let M G W xp be have the following structure: 



M 



a b 
b a 



(4) 



b b 
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It is straightforward to check that Tl g M = MH g for every permutation matrix H g , 
matrices satisfying this property constitute the fixed point subspace. It is well known 
that via the permutation representation there is an orthogonal matrix 2 T such that: 



T*MT = M 







Cl 


. 


. 


M 







c 2 • 


. 









. 


• c 2 



Here |Z| = 2, si = l,mi = 1, s 2 = l,m 2 = p — 1. In our setting, covariance matrices 
with this structure arise when we have a completely exchangeable sequence of random 
variables. 



3. Product of symmetric groups: Let T%, . . . ,Tf- be finite index sets where |T*| 



r, . 



We consider the product of symmetric groups Sym^ x . . . x Sym\ Tk \ acting on the index 
set T\ x . . . x Tfc . Consider a block k x k matrix of the form: 



M 



Mil M i2 
M 22 



M 2k 



kk 



(5) 



where for each j G {!,...,&}, Mjj G IR rj Xrj is of the form (4), and each My /or i ^ j is 



a constant multiple of the all ones matrix. Let a = (o"i, . . . , crjt) G Sym^ 



Sym^ 



be a permutation with this product structure. Let I\. a be its corresponding permutation 
matrix. Then it is clear that for a matrix of the form described above, Ii a M = MIi a , 
and the set of matrices satisfying this relation constitute the corresponding fixed point 
subspace Wg. Using standard representation theory results, it is possible to show that 
there is an orthogonal matrix T such that 



T*MT 



C 
cil ri _i 
















Cfci, 



where C G IR and I r G IR rxr is the identity matrix of size r. (In fact, upto a 
permutation of the columns, T can be chosen to be block diagonal where each diagonal 
block is of the form described in the preceding example). In the above example, \I\ — 
k + 1, Si — k, mi = 1. For i G {2, . . . , k + 1} ; Sj = 1, rrii = r i _ 1 — 1. Covariance 
matrices with product symmetric group structure arise when one has a set of random 
variables that are partially exchangeable. 



2 In fact any orthogonal matrix T whose first column is a multiple of the all ones vector, with the 
remaining columns span the orthogonal subspace will achieve the desired diagonalization. 
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Covariance matrices of interest in this paper will be invariant with respect to a group (3, 
and hence belong to the fixed point subspace W©. A fundamental operation in our paper 
will be the Euclidean projection of a given matrix M onto W©. The next lemma relates 
projection to any important notion called Reynolds averaging. Reynolds averaging operates 
by generating the orbit of M with respect to the group, i.e. the collection of matrices 
ILjMnJ , and averaging with respect to the orbit. 

Lemma 2.2. Let P©(-) be the Euclidean projection operator onto the fixed point subspace 
W©. Then 



T 
9 

gee 



Proof. Since W© is a subspace, it suffices to show that Tr((M — V@(M)) V®(M)) = (by 
the orthogonality principle and that fact that (A,B) = Tr(AB) is the inner product that 
generates the Euclidean norm over the space of symmetric matrices). We obtain the proof 
via the following chain of equalities: 

Tr(7>©(M)P e (M)) =Tr ( ±- £ n,Mn^ £ U g MuA 

V ' ' he® ' ' gee / 

= ^E Tr ( n ^E n ^ Mn n 

1 1 h&6 \ gee / 

= ^E Tr ( M E n ^ Mn ^ 

1 1 he® \ gee 

= i^E Tr ( M ^( M )) 



hee 

Tv (MV®(M)) . 



□ 



Remark 2.1. Note that the fixed point subspace, when suitably transformed, is a set of 
block diagonal matrices of the form (2) as a consequence of Schur's lemma. Projection of a 
matrix M onto the fixed point subspace in these coordinates corresponds precisely to zeroing 
components of M that are off the block diagonals, following by averaging the diagonal blocks 
corresponding to multiplicities of the same irreducible representation. 

2.2 Group theoretic parameters of interest 

Given a covariance matrix S which is invariant with respect to the action of a group 25, 
we introduce the following parameters that depend on S and (3. Our subsequent results 
characterizing the rates of convergence of the estimated covariance to the true covariance 
will be expressed in terms of these group theoretic parameters. 
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Given E, it will often be convenient to associate informally a weighted undirected com- 
plete graph to it, so that the entries of E correspond to edge weights. We will view the rows 
and columns of E as being indexed by a set V so that E G IRl^l*!^. The group (5 is then 
precisely the automorphism group of this weighted graph. Let G V x V (we will call 
this an "edge"). Define the edge orbit as 

0(ij) = {(g(i),g{j))\ge<d} 3 

We further define V(i,j) to be the set of nodes that appear in the edge orbit 0(i,j), i.e. 

V(i,j) = {k eV | 3/ eV,g G (3 such that (g(k),g(l)) G 0(i,j)}. 

We define dij (called the degree of the orbit 0(i,j)) to be the maximum number of times 
any node appears in the edge orbit 0(i,j), i.e. dij(k) = J2iev(ij) 1 (3/ G V | (k, I) G 0(i,j)), 
and d^ = max feeV (i,j) dij(k). 

Definition 2.1. Let M G be a matrix in the fixed point subspace. We define the orbit 
parameter O to be the size of the smallest orbit (with respect to the action of (5 on M) , i.e. 

O = min\0{i,j)\ 

The normalized orbit parameter Od is defined to be 

. \Q{i,j)\ 
O d = mm — . 

Example 2.2. For a covariance matrix of the form (3) shown in Example 2.1, 

0(l,2) = {(l,2),(2,3),...,(p-l,p),(p,l)} 
V(l,2) = {l,...,p} 

d\2 = 2 (since each node appears twice in (9(1,2)). 

For this example O = p and 0& = \. 

Example 2.3. For a covariance matrix of the form (4) with the action of the symmetric 
group of order p, it is easy to check that O = p (there are only two orbits, the first is 0(1, 1) 
which is of size p and degree 1, and the other is (9(1,2) of size v ^~ 1 ' and degree p — 1. Hence 
= p andO d =\. 

Example 2.4. For a covariance matrix of the from (5) with the action of the product of 
symmetric groups, if the sizes \Tj\ = for i = 1, . . . , k then it is clear that O = minj and 
O d = mini f ■ 

3 The pairs (g{i),g(j)) are to be viewed as unordered. Also, each edge is enumerated only once so that if 
(p(*))fl(j)) = (*>j) then we treat them as the same edge in 0(i,j). 
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In addition to these combinatorial parameters, we will also be interested in the following 
representation-theoretic parameters (which are also intrinsic to the group and the fixed point 
subspace W©). Given E and its block diagonalization (2) as described by Schur's lemma, we 
will be interested in the parameters Sj (the sizes of the blocks) and rrii (the multiplicities) 
for i G I. (We note that given a fixed point subspace Wg, multiplicities corresponding to 
many irreducible components may be zero, we call these the inactive components. We will 
only be interested in the active components. Thus the set of irreducible representations I 
mentioned henceforward will include only active representations). 

2.3 Symmetric Covariance Models 

Let X ~ A/"(0, X) be a normally distributed Revalued random vector where the covariance 
£ is unknown. Given a small number of samples . . . , X^ n \ we wish to learn properties 
of the model (for example, the covariance matrix E, the inverse covariance G := S -1 , the 
sparsity pattern of £ and 6). We will be interested in the high- dimensional setting where 
p is large and n < p, so that the number of samples is much smaller than the ambient 
dimension of the problem. 

Throughout the paper we will make the two following assumptions regarding our model: 

Assumption 2.1. 

1. We assume that £ is invariant with respect to a known group & so that £ E Wg,. 

2. We assume that the spectral norm of the true covariance matrix is bounded independent 
of the model size p, i.e. 



for some constant if;. 



Remark 2.2. A simple but important consequence of the <5-invariance o/£ (i.e. IT 5 £ITJ = £ 
for all g E <5) is that = Ti g u\ g (j\, and = Ylgu^g^) for all g E (3. 

Given samples . . . X^ we define the following. 



1. The sample Reynolds average of the k sample is: 



Eg) := V<s (X^(X^f) = j^Yl n ^ ( ^ W X' ( 6 ) 



gees 



2. The empirical covariance matrix is: 




(7) 



fc=i 
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3. The (5 -empirical covariance matrix is: 

S := V & (E n ) 



(8) 



Note that the entry of the (3-empirical covariance matrix may be written as: 

n 

± J_yy x ^\x% (9) 

1 1 k=l ge© 

where is the i th component of the sample X^ k \ We point out that (9) may be interpreted 
as averaging X - with respect to the orbit 0(i,j), followed by averaging over the samples 
k = 1, . . . , n. As we will argue subsequently, due to the double averaging, the estimate of the 
covariance matrix obtained via the (3-empirical covariance matrix is a much better estimator 
as compared to the empirical covariance matrix. Indeed, quantification of this fact is one of 
the central themes of this paper. 

Remark 2.3. Note that Lemma 2.2 ensures that if M >z 0, then V®(M) >z (since posi- 
tive definiteness is preserved under conjugation and addition). Hence the projection of the 
empirical covariance matrix P©(S n ) is a meaningful estimate of £ in the sense that it is a 
valid covariance matrix. 



2.3.1 Sufficient statistics 

We make a few brief remarks about the sufficient statistics for symmetric Gaussian models. 
For an arbitrary (zero-mean) model, it is well-known that the sufficient statistics correspond 
to quadratics associated to the variance, i.e. the sufficient statistics are 

4>ij(x) = XiXj for (i, j) G V x V. 

For ©-invariant models, the sufficient statistics are 



4> e {x) = XiXj for e = (i,j) G (V x V)\<3, 

where (V x V)\<3 denotes the quotient of V x V with respect to with respect to the 
equivalence relation (i, j) = {k, I) if g{i) = k and g(j) = I for some g G 0. Put more simply, 
the sufficient statistics correspond to quadratics (fiij, and we have only one statistic per orbit 
0(i,j) (as opposed to one per pair (i,j)). The representative sufficient statistic from orbit 
0(i,j) may be picked to be: 

\o(i,j)\ ^ XtXy 

An alternate set of sufficient statistics can be obtained via Schur's lemma by observing 
that Y = T*X is an equivalent statistical model whose covariance is block diagonal. The 
sufficient statistics are 6ij{y) = ViVj for i,j belonging to the same irreducible component. 
Note that one only needs one set of statistics per irreducible component, and they may be 
expressed as above by averaging the corresponding sufficient statistics with respect to the 
multiplicities of each irreducible component. 
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3 Rates of convergence 



In this section we will be interested in the statistical properties of the estimator E := "P,s(E n ). 
Note that given samples . . .X^ n \ E(E) = E. In this section, we will be interested in 
establishing rates of convergence of E in different norms, i.e. we will provide upper bounds 

on P ^(||E — E|| p > d^j , where || • \\ p are the Frobenius, spectral and norms. The rates 
presented here hold for the covariance estimation of any sub-Gaussian random vector X. 
Though we present them in the context of normally distributed random vectors, identical 
analysis holds for sub-Gaussians more generally. 



3.1 Spectral norm and Frobenius norm rates 

To obtain our rate results, we use the following standard result regarding spectral norm rates 
for an arbitrary normal model. 

Lemma 3.1. Let X ~ Af(0, E*) be a normal model with E* G IR pxp . Let ip = ||S*||. Given 
5 > with 5 < 8tp, let the number of samples satisfy n > 64 |f . Let E n be the empirical 
covariance matrix formed from the n samples. Then we have 

-n5 2 
128^ _ 

Proof. This is a standard result, a proof follows from [15, Theorem 11.13]. □ 

Theorem 3.1. Given an invariant covariance matrix E with ijj = ||E|| and a sample co- 
variance E n formed from n i.i.d. samples drawn from a Gaussian distribution according to 
covariance E, we have that 

n minj g j niiS 2 " 



|E" - E*|| > 5) < 2exp 



7><5(£ n ))|| > 6] < |X|exp 



128^ 2 



for n > maxjgj and for 5 < 8ip- 

Proof. Due to the unitary invariance of the spectral norm, we may assume that E is put in 
the suitable block-diagonal coordinates based on Schur's representation lemma. Letting X 
denote the indices of the set of active irreducible representations, using Lemma 3.1 we have 
for each i G X that 

~12# 

for 5 < 8ip and for nrrii > 64 ^ . Note that we get the extra factor of m; (the multiplicity 
of the representation indexed by i) since blocks corresponding to the same irreducible rep- 
resentation but different multiplicity are independent and identically distributed. Hence we 
average over the rrii blocks of size Sj (see Remark 2.1). Consequently we have by the union 
bound that 
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for n > maxjgx 



6A.Sjip 2 



□ 



Theorem 3.2. Given an invariant covariance matrix £ with ijj = ||£||2 and a sample co- 
variance S n formed from n i.i.d. samples drawn from a Gaussian distribution according to 
covariance S ; we have that 



for n > maxjgj 6 ^f 2 and for 5 < 8ip. 

Proof. The proof of this result follows directly from that of the previous theorem by noting 



As a direct corollary of the preceding theorems, we get the following sample complexity 
results. 

Corollary 3.1. Assume that ip is a constant (independent of p,n). The sample complexity 
of obtaining ||E — S|| < 5 for 5 < 8ip with probability exceeding 1 — for some constant c is 
given by: 



for some constant C\ . 

The sample complexity of obtaining ||S — E||i? < 5 for 5 < 8ip with probability exceeding 
1 — \ for some constant c is given by: 



for some constant C*2. 

Remark 3.1. It is instructive to compare the sample complexity described by Lemma 3.1 
versus the rates obtained from Theorem 3.1 for the case of the covariance matrices invariant 
with respect to the cyclic, the symmetric, and the product of symmetric groups. Note that the 
spectral norm sample complexity obtained from Lemma 3.1 is n > for some constant C . 
As explained in Example 2.1, for the cyclic group case = = 1 for alii so that the sample 
complexity is n > Cl $i p . For the symmetric group Si = l,m\ = 1 and s 2 = 1,^2 = p — 1 
so that the sample complexity becomes n > Cl $i p . For the case of product of k symmetric 
groups, the sample complexity is n > c ' max i^' lo gp} _ Hence in all three cases, projection onto 
the fixed point subspace delivers a dramatic improvement in the sample complexity. 




that || • \\f < y/p\\ ■ || for p x p matrices. 



□ 
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3.2 norm rates 

In this section we examine the ioo rates of convergence of the estimate 

E := V & (E") . 

In this section, as elsewhere in the paper, we will assume that our covariance model E has 
bounded spectral norm, i.e. ||E|| is bounded independent on p and n. It is possible to 
make the dependence on this quantity more explicit at the expense of more notation in a 
straightforward manner, for simplicity and clarity we omit doing so. 

Theorem 3.3. Let 0(i,j) be the orbit of the element let its degree be d^. Then there 

are constants Ci, C 2 such that for all i,j 

P (|E tJ - Eyl > t) < max jexp (-C in \0(i,j)\t 2 ) ,exp (-0 ^°^ Vj 

The proof is provided in the appendix. 



Recall that O := min^- \0(i,j)\ denotes the size of the smallest edge orbit, and Od : 

H '3 da ' 



miiiij ^ff^ , and let no denote the number of distinct (i.e. inequivalent) edge orbits 0(i,j) 



Note that no < p 2 trivially. 
Corollary 3.2. We have, 

P ^max |Ejj — Ejj| > t*J < no max {exp (-dnOt 2 ) ,exp (-C 2 nO d t)} . 

Proof. From Theorem 3.3, and by the definition of O and Od we have for each (i,j): 

P (ity - Eyl > tj < max {exp (-dnOt 2 ) ,exp (-C 2 nO d t)} . 

By taking union bound over the distinct orbits we obtain the required result. □ 
Corollary 3.3. We have the following sample complexity results: 



For |Ejj — Ejj| < 5 with probability exceeding 1 — ^ for some constant c, we need 

logp d i:j log p 



n > C3 max 
/or some constant C3. 
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Proof. This is a simple consequence of Theorem 3.3. 



□ 



Corollary 3.4. We have the following sample complexity results for the norm: 



max | Sy 



XL | < Op max 




logp logp 



) 



(11) 



To obtain an error bound maxjj — Sy| < 5 with probability exceeding 1 — \ for some 
constant c, we need 




for some constant C3. 



Proof. This is a simple consequence of Corollary 3.2. 



□ 



We remind the reader that the group-theoretic parameters O and Od were introduced in 
Section 2.2. These parameters roughly capture the sizes of the orbits in the group. Since 
samples corresponding to the same orbit are statistically equivalent (though the information 
is not i.i.d.), we may expect to reuse this information to get statistical gains. Corollary 3.4 
makes these gains explicit, indeed the sample complexity is smaller by a factor corresponding 
to the size of the orbit. 

Remark 3.2. It is again instructive to contrast the standard 1^ rates versus the improved 
rates derived in this section for symmetric models. The standard rates require a sample 
complexity of n > C1 ^ p for 5 = o(l). For the cyclic and symmetric group, our rates imply a 
sample complexity ofn> samples. (Similarly for the case of the product of symmetric 

groups our sample complexity is n > s2 1 ° sp .) We point out that in the case of models 
invariant with respect to the symmetric group, the overall rate is governed by the rate 
of convergence of the diagonal entries of £ corresponding to the orbit (9(1,1). We point 
out that the orbit 0(1, 2) is much larger, and the sample complexity for estimating S 12 is 



4 Sparse Model Selection 

4.1 Covariance Selection 

Estimation of covariance matrices in a high dimensional setting is an important problem 
in statistical analysis since it is an essential prerequisite for data analysis procedures such 
as dimension reduction via PCA, classification, and testing independence relations. In the 
high dimensional setting (which arise for example from the discretization of Markov random 
fields), the dimension of the problem may be larger than the number of samples available. 
In this setting it is well-known that the empirical covariance matrix is a poor estimator for 
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the covariance matrix [8]. There is a growing body of literature on understanding regular- 
ization procedures for covariance matrices that possess specific structure. For example, if 
the covariance matrix known to be sparse, Bickel and Levina [8] show that thresholding the 
empirical covariance matrix is both a natural and statistically meaningful means of regular- 
ization. Indeed they provide precise rates of convergence of the thresholded estimate to the 
true covariance matrix. 

In this section we analyze sparse covariance matrices arising from Gaussian models with 
symmetries. We show that if the covariance matrix is known to be invariant under the action 
of (3 and is sparse, a statistically meaningful way of regularizing the empirical covariance 
matrix is to first project it onto the fixed point subspace, followed by thresholding. Doing 
so provides a faster rate of convergence and reduces the sample complexity by a factor that 
depends on the group (more specifically on the size of the smallest orbit and the orbit degree). 

More precisely, let X ~ A/"(0, £), and let S be invariant under the action of the group (25. 
Furthermore, we adopt a model of sparsity on S that is somewhat more general (Bickel and 
Levina [8]) since it encompasses the usual notion of sparsity as well as power-law decay in 
the coefficients of S as follows. Define the set of approximately sparse covariance matrices 
via a uniformity class as: 

U{q) = js | < M, J2 W < c o(p), for all i J (12) 

for < q < 1. If g = we have a set of sparse matrices as follows: 

W(0) = |s | E« < M, ^ 0) < co(p), for all z j . (13) 

Here we think of Cq(p) as a parameter that controls the sparsity level of S, and is to be 
thought of as sublinear in p. Given samples X\, . . . , X n from this distribution, recall that S n 
is the empirical covariance matrix and the ©-empirical covariance matrix is X := V<& (£ n ). 

Recall that 1(A) denote the indicator function for the event A and define the thresholding 
operator (at threshold s) by 

%(M) = [M ij l(\M ij \>s)}, (14) 

so that the entry of M is retained if it is larger than s and is zero otherwise. We 

propose the following estimator for the covariance: 

Z est = T t (V< s (X n )) 

for a suitable choice of threshold parameter t to be specified in the subsequent theorem. As in 
[8] , we note that the thresholding operator preserves symmetry and under mild assumptions 
on the eigenvalues, preserves positive-definiteness. Indeed, the next theorem provides a rate 
of convergence of our thresholded estimate to the true covariance in spectral norm. As a 
consequence, if the smallest eigenvalue of S is bounded away from (independent of p and n), 
the theorem also establishes that with high probability the thresholded estimate is positive 
definite. 
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Theorem 4.1. Suppose S G U(q) satisfies Assumtion 2.1. Choose the threshold parameter 

t = M' max 



'logp logp 



nO ' nO, 



d 



for a constant M' . Then uniformly on U(q) 



\Tt{V e {E n ))-E\\ = P c (p)max 



'logp logp 



Remark 4.1. Note that the sample complexity gains observed in the norms translate 
directly to gains for sparse covariance selection. Indeed, in the absence of symmetry the 



1-9 



deviations 5 = Op ^Cq(p) (-^p) 2 J- In the case of models that are sparse and invariant 
with respect to the cyclic group, we see 5 = Op ( c (p) ( ) 2 ) . Moreover, for invariance 



with respect to the symmetric group, we have 5 = Op ^co(p) j ■ Hence we again 

observe sample complexity gains corresponding to the orbit parameters. 

Remark 4.2. We note that as in [9], sample complexity results can be established in the 
Frobenius norm, various other tail conditions, and for models with alternative forms of spar- 
sity [8] arising from banding, tapering, etc. We omit a detailed discussion of the same in 
this paper. 



4.2 Inverse Covariance Selection 

In the preceeding section we studied a means of regularizing structured covariance matrices 
which were sparse and symmetric with respect to (3. In many applications, it is much more 
suitable to study sparsity in the concentration (i.e. inverse covariance matrix). These are 
often best described using the terminology of graphical models [25]. 

Formally we assume that the inverse covariance of 0* := S -1 is sparse. We will associate 
a weighted undirected graph Q = (V, E) with p nodes and edge weights 0£- so that 9*^ ^ 
if and only if G E. In the context of Gaussian graphical models, the edges in the graph 
have a particularly appealing interpretation which makes them suitable in the modeling of 
Gauss-Markov random fields. Indeed, the edges represent conditional independence relations 
(i.e. Xi and Xj are conditionally independent of a set of nodes Xs for S C [p] if and only if 
S is a separator of % and j in the graph Q ) . 

A fundamental problem in graphical models learning is that of inferring the structure of 
the graph Q and the corresponding non-zero entries of 0* from measurements (see [31] and 
the references therein); this problem is called the graphical model selection problem. In the 
high-dimensional setup, one wishes to do so from a small number of measurements n -C p. 
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This has been shown [27, 31] to be possible provided the graph has additional structure, and 
is simple in the sense that it has low degree. 

Ravikumar et al. [31] propose a convex optimization based procedure to obtain consistent 
model selection of G* in the high-dimensional regime. Specifically, if S n is the empirical 
covariance matrix, they propose selecting O* as the solution to the following £ 1 regularized 
logdet problem: 

:= argmin Tr(£ n 6) - logdet(9) + fi n \\Q\\i off- (15) 

This approach has several appealing properties. Firstly the above optimization problem is 
convex, i.e. both the optimization objective and the constraint set are convex. Consequently, 
this formulation is amenable to efficient computation, since one can apply standard numerical 
procedures to efficiently solve for the global optimum of the above convex optimization 
problem. 

Secondly, the above formulation is statistically meaningful. Indeed if fi n = one recovers 
the standard maximum entropy formulation of the problem, and the t\ penalty on the off- 
diagonal terms in B can be viewed as a regularization term which encourages sparsity. In [31], 
the authors prove that the solution to problem (15) has good statistical properties (under 
reasonable statistical assumptions). The model selection can be shown to be consistent with 
good rates of convergence (see [31, Theorem 1, Theorem 2]) when the regularizer fi n is chosen 
judiciously. Importantly, these rates are governed by the degree of the graph Q and only the 
logarithm of the ambient dimension p. 

In this section, we will study in a similar setup the problem of Gaussian graphical model 
selection in which the graph Q has low degree and also a nontrivial automorphism group. 
Recall that E 6 implies that G* G <5, hence (3 is the automorphism group of the graph Q . 
We will assume that the group & is non-trivial and known, where as the graph Q is unknown. 
We examine how knowledge of the structure of (3 can be exploited for the model selection 
problem using i.i.d. samples . . . ,X < ™- ) drawn from Af(0, (G*) _1 ) to obtain statistical 
gains in terms of the sample complexity. 

We propose exploiting the known symmetry structure by solving the modified log- determinant 
program with subspace constraints: 

G:= argmin Tr(E n G) - logdet(G) + // n ||G||i i0ff . (16) 

The following lemma shows that solving the above problem is equivalent to symmetrizing 
the empirical covariance matrix directly and then solving the standard formulation. 

Lemma 4.1. Let S := V® (S n ). The optimization problem 

argmin Tr(SG) - logdet(Q) + /i„||6||i off (17) 

has the same solution as (16). 
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Proof. Let O be the optimal solution of (16). Let / s (0) = Tr(£™9) — log det(@) +/x ri || @|| i, ff- 
Then by definition / s (Oo) = is(n 9 OILj) for all g G <3. Since / s (0) is convex, by Jensen's 
inequality we have: 



Next, note that 



M^E n ^ n ;n </.(e„). 

V 1 gS© / 

However, ^ gg(5 Il 9 6oIlJ is feasible for the problem (16), hence 

M^E n s o n n =/-(©o). 

V 1 1 36© / 

Hence the problem (16) has the same optimal value as: 

argmin Tr(Ee) - logdet(6) + // n ||©||i,off- (18) 

(Both the optimization problems are strictly convex, and hence have the same unique solution 
[31, Appendix A]). 

Now we note that Tr(S6) = Tr(n 9 £nj0) = Tr(£n 9 6nJ) for all g G <3. Moreover 
logdet(-) is unitarily invariant. Lastly, note that U g QU^ for any permutation matrix Yl g has 
the effect of permuting the rows and columns of 0, which does not affect the || • ||i j0 ff norm. 
Hence the objective function of (17) is invariant with respect to (J5 and is convex. Since the 
constraint set of (17) (the positive definite cone <S++) is also invariant with respect to (5 
and convex, by a standard convexity argument it is possible to show that its solution must 
belong to the fixed point subspace. Thus the subspace constraint in (18) is redundant, and 
may be dropped. □ 

Before we proceed to state the statistical properties of the optimal solution to (16), 
we must make some standard statistical assumptions. We denote £ and B* to be the 
covariance and concentration matrix respectively for the true model. We denote by S the 
set of indices (including the diagonal entries) that are non-zero in 9*, and denote by 
S c the complement of this set. For any two index sets T, T , and define the matrix M TT i 
to be the submatrix of M with rows and columns indexed by T, T respectively. Finally we 
define V* := Cg) O*" 1 , where £g> denotes the Kronecker product. 

We define the following quantities: 

K S* = 1 1 21] 1 1 oo,oo K r* = || (r^S 1 ) ||oo,oo- (19) 

For simplicity, we will assume that that these quantities are constants in our asymptotic 
setting, i.e. they are bounded independent of p and n (although it is straightforward to 
make the dependence more explicit). Finally we make the following incoherence assumption 
which is also standard for such statistical models [31, Assumption 1]: 
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Assumption 4.1. There exists some a G (0, 1] such that 



max ||r* 5 (T* ss ) ||i < 1 — a. 

e£5 c 

We define the degree of the graphical model to be 
Note that this precisely corresponds to the degree of the graph Q. 

Theorem 4.2. Consider a symmetric Gaussian graphical model satisfying Assumption 2.1 
and Assumption 4-1- Let be the unique solution to the problem log- determinant program 
(16) with regularization parameter 



\ n = Mi max 



log p log p 



nO ' nO d 

for some constant M\. Suppose the sample size is bounded by n > max(ni,n.2), 

d 2 g \ogp dglogp 
ni = M 2 ^ L — — n 2 = M 3 —- 

for some constants M 2 and M 3 . Then with probability exceeding 1 — ^ for some constant 
t > we have: 

1. The estimate satisfies: 

'logp logp 



|0 - 0*IU < M max 



nO ' nO d 



for some constant Mq. 

2. It specifies an edge set E(Q) that is a subset of the true edge set E(Q*), and includes 
all edges with > M 4 max , \ for some constant M 4 . 

Proof. Is a direct consequence of Corollary 3.3 and [31, Theorem 1]. □ 

Remark 4.3. We note that the constants Mq, Mi, M 2 , M 3 , M4 depend on the parameters 
a, Ky,*, Kr*, t in a way that can be made explicit, as has been done in [31]. 

A more refined version of the second part of the preceding theorem that is analogous 
to [31, Theorem 2], and that establishes model selection consistency can also be established 
using our bounds, though we omit the statement of a precise result here. 
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Remark 4.4. Note that sample complexity gains in the norm rates are inherited for sparse 
inverse covariance selection. Indeed, the standard sample complexity requires n > Cdglogp 
samples to get an accuracy of 



IB - e Ik < uW— , 

n 



we have significant gains for models with symmetry. For the cyclic group, we need n > 

Cdl 1 ^ to obtain 
y p 



le-eiL <m 



'logp 



np 

For the symmetric group also we need n > Cdg 1 -^- to obtain 



ie-e*lL<M„ 



'logp 
np 



4.2.1 Extension to Latent Variable Graphical Model Selection 

The results presented in [31] have been extended in [13] to deal with the setting in which one 
may not observe samples of all the relevant variables. In contrast to graphical model selection 
in which one seeks a sparse approximation to a concentration matrix, [13] proposed a convex 
optimization method to approximate a concentration matrix as the sum of a sparse matrix 
and a low-rank matrix. The low-rank component captures the effect of marginalization 
over a few latent variables, and the sparse component specifies the conditional graphical 
model among the observed variables conditioned on the latent variables. The main result 
in [13] is that consistent model selection in the latent- variable setting (i.e., decomposition 
into sparse and low-rank components with the components having the same support /rank as 
the true underlying components) via tractable convex optimization is possible under suitable 
identifiability conditions even when the number of latent variables and the number of samples 
of the observed variables are on the same order as the number of observed variables. The 
sample complexity in these consistency results can be improved by exploiting symmetry 
in the underlying model in a similar manner to our presentation above. To consider just 
one simple example, if the marginal covariance matrix of the observed variables is invariant 
with respect to the cyclic group, then consistent latent-variable model selection using the 
convex programming estimator of [13] and by exploiting symmetry is possible (under suitable 
identifiability conditions) when the number of latent variables is on the same order as the 
number of observed variables, but with the number of samples of the observed variables 
growing only logarithmically in the number of observed variables. This improvement in the 
rates follows directly from our analysis in Section 3. 
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5 Computational Aspects 



In the preceding sections we have argued that knowledge of symmetry in a statistical model 
can lead to significant statistical gains in terms of sample complexity. To do so, one must 
exploit knowledge regarding symmetry judiciously by regularizing the model appropriately. 
We will now investigate the computational efficiency associated to symmetry regularization. 

5.1 Projection onto the fixed-point subspace 

A fundamental computational step in exploiting symmetry in statistical models is to project 
the empirical covariance matrix S n onto the fixed point subspace W©, i.e. computing 

As argued in Section 3, this estimator is a far more accurate estimator of the covariance as 
compared to the empirical covariance matrix. Moreover, in the selection of sparse covariance 
models with symmetry, our estimator is of the form: 

where t is an appropriate threshold level as described in Section 4.1. In the selection of 
sparse graphical models as described in Section 4.2, one needs to solve problem (17), which 
requires the computation of S = "Pg(E n ). 

Conceptually the projection can be computed in numerous ways. If the group is of small 
order, it is possible to compute the projection using Lemma 2.2. If the group has a small 
number of generators (for example the cyclic group) C5 gen Q <5, then the fixed point subspace 
can be expressed as 

W @ = {M | p(g)M = Mp{g) \/g G gcn } , 
and thus projections can be computed by solving the constrained least squares problem 

P<s(£ n )= argmin ||M-S n |||. 

M : p{g)M=Mp{g) V 9 6<S gcn 

One need only specify one matrix equality constraint corresponding to each generator of the 
group (the constraints corresponding to other group elements are implied by the generators) , 
so that for groups such as abelian groups this assumes an especially simple form. 

Lastly, if the diagonalizing matrix T corresponding to the irreducible representations 
from Lemma 2.1 are available, the projections are especially easy to compute. To compute 
7 :, We(S n ) one can simply compute $ := T*Y, n T. One can truncate the off-diagonal blocks of 
Q and average with respect to the multiplicities as described by Remark 2.1 to obtain 
Finally, one obtains V We (E n ) = T$'T*. 
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5.2 Solution of the l\ regularized logdet problem 

Given a Gaussian graphical model which is symmetric with respect to the group (3, we 
examined in Section 4.2 how one can reliably recover the model by solving the optimization 
problem (16) (or equivalently (17)). This optimization problem is convex and expressible as 
a log det conic program. An interesting consequence of symmetry reduction techniques that 
are now well- understood in optimization [21, 36] is that by exploiting the group invariance of 
the objective function and the constraint set one can reduce the size of (17). Let T := T*QT, 
and S := T*ET. Then (17) reduces to: 

mm Tr (ST) - logdet (r) + ^ n ||TTT*||i )0 ff 
subject to: r y 0. 

In the resulting conic program in the transformed domain, the optimization variable T is 
block diagonal, so that there is a reduction in the problem size. Furthermore, many iterative 
algorithms require linear algebraic operations (such as matrix inversion). These steps can 
be sped up for block diagonal matrices, and the complexity is governed by the size of the 
largest irreducible component as given by Schur's lemma. 

In a high dimensional setup where the problem size p is large, it is often prohibitively 
expensive to implement an interior point method to solve the problem due to the compu- 
tational burden of the Newton step [11]. Moreover, the above formulation is inefficient due 
to the introduction of additional variables SV,. For these reasons, it is far more desirable to 
implement an accelerated first order method that uses the special structure of the objective 
function. One such method described by d'Aspremont et al. [14, pp. 4-5], provides an 
efficient iterative algorithm with low memory requirements. A careful examination of their 
algorithm reveals that the iterations are ©-invariant, i.e. each iterate of the algorithm lies on 
the fixed-point subspace. Moreover, the computational bottlenecks in the implementation of 
the algorithm correspond to inversion and eigenvalue decompositions of matrices that belong 
to the fixed point subspace. Here again, the unitary transform T and the associated block- 
diagonal structure is particularly useful and will provide significant computational speedups. 

6 Experimental Results 

In this section we report results obtained from a set of simulations that illustrate the efficacy 
of accounting for symmetry via the methods proposed in this paper. 

In the first simulation experiment we generate a covariance matrix E of size p = 50 that is 
circulant and hence invariant with respect to the cyclic group. The first row of the circulant 
matrix is randomly generated (and then modified to ensure the matrix is positive definite). 
Figure 1 illustrates the rates of convergence that are observed in different matrix norms as 
the number of samples are increased. (For each sampling level, 20 trials are conducted and 
the average error is reported). It is apparent that the estimate formed by projection of E n 
onto the fixed-point subspace far outperforms the estimate formed by E n . 
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Number of Samples Number of Samples Number of Samples 

Figure 1: A comparison of the rates of convergence for a covariance matrix that is invariant with 
respect to the cyclic group. The ©-empirical covariance matrix is a better estimate as compared 
to the empirical covariance matrix. 



In the next simulation, we generate a sparse circulant covariance matrix £ of size p = 50 
(so that it is again invariant with respect to the cyclic group). The entries as well as the 
sparsity pattern of the first row of the circulant matrix are random. The sparsity pattern 
is such that twenty percent of the matrix has non-zero entries. Figure 2 compares the 
performance obtained by different means of regularization. The first is symmetry-oblivious 
and uses the methodology proposed by Bickel and Levina [8], that thresholds the matrix at 

an appropriate level (that is proportional to v/^p)- The second methodology is symmetry 
aware, and projects the empirical covariance matrix to the fixed-point subspace, followed by 
thresholding (proportional to \J^^ a s proposed in Section 4.1). We compare the empirically 

observed error in different norms. 

In the last experiment, we consider the problem of recovering a sparse inverse covariance 
matrx (i.e. a graphical model) of size p = 50. The graph corresponding to the graphical 
model is a cycle. The entries 0*, = 1 for all i. The edge weights of the cycle are all set to 
Q*j = —0.25, the remaining entries of O* are zero. We perform an experiment pertaining to 
model selection consistency, i.e. recovering the structure of the graph from a small number 
of measurements. We compare two methodologies. The first is symmetry-oblivious, and uses 
the l\ regularized logdet formulation proposed in [31] using only the empirical covariance 
matrix S n in the objective function. The second uses the formulation (17) proposed in 
Section 4.2, i.e. we first project S n onto the fixed-point subspace corresponding to the 
cyclic group, and then solve the l\ regularized logdet problem. The empirically observed 
probabilities of correct model selection using these two methods are shown in Fig. 3. This 
plot is generated by varing the number of samples from n = 5 to n = 1505 at increments 
of 100 samples. For each sample level, we perfom 50 experiments and report the average 
probability of success. The regularization parameter for the symmetry oblivious method 
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Figure 2: A comparison of the rates of convergence for a sparse covariance matrix that is invariant 
with respect to the cyclic group. Projection followed by thesholding provides a better estimate. 



is chosen proportional to y -^p, as proposed in [31], and for the symmetry aware method 

it is chosen to be proportional to \J^j^--, as proposed in Section 4.2. Both the symmetry 

oblivious, as well as the symmetry aware formulations required the solution of t\ regularized 
logdet problems. These were solved using the solver COVSEL [14]. 

As seen in these experiments, for models containting symmetry the symmetry aware 
methodology proposed in this paper significantly outperforms symmetry oblivious methods. 



7 Conclusion 

In this paper we have examined Gaussian models with symmetries in the high-dimensional 
regime. We have suggested a natural means of estimating a (J5-invariant model from a 
small number of samples. The key idea is the formation of the projection of the empirical 
covariance matrix onto the fixed point subspace, which enjoys good statistical properties. As 
one of the main results of this paper, we have made precise the statistical gains enjoyed by 
this estimate by establishing explicit sample complexity bounds for the Frobenius, spectral, 
and norms. These sample complexity bounds depend naturally on certain group theoretic 
parameters that characterize the extent to which the group theoretic structure allows sample 
reuse. We have further used our sample complexity bounds to analyze sparse covariance 
models, and sparse graphical models that enjoy symmetry. Finally we provide experimental 
evidence that verifies our results via computer simulations. 
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Figure 3: A comparison of the model selection consistency obtained by 7 ? ®(E n ) based logdet 
formulation versus a S n based log det formulation. 
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A rates 

Before we present the formal proof, we remind the reader of some basic machinery related 
to pre-Gaussian random variables. 

Definition A.l. A random variable Y G M is called pre-Gaussian [12] with parameters r, A 
if 

Eexp(AF) < exp ( \t 2 \ 2 



for all A G (-A, A). 

A useful fact about the pre-Gaussian variables (see [12, Chapter 1]) is that if Y is pre- 
Gaussian with parameters r, A, the random variable cY is pre-Gaussian with parameters 
\c\t, A. The following lemma characterizes the tail behavior of pre-Gaussian random vari- 
ables. 

Lemma A.l. Suppose Y is a pre-Gaussian random variable with parameters r, A. Then 

F(\Y\ > t) < { 2exp (~^) if ° < * < At " 
\ 2exp(-f) ifAr 2 <t. 

Proof. See [12, Chapter 1]. □ 

The canonical examples of pre-Gaussian random variables are \ 2 distributed random 
variables. Indeed, let D G W xp be a diagonal positive definite matrix and let X ~ A/"(0, D) 
be a M p -valued Gaussian random variable. Then the random variable X T X — Tr(£)) is 
pre-Gaussian (see [12, Lemma 5.1, Example 3.1]) with parameters 

r - Tr(D) A - - — . (20) 

Lemma A. 2. Let FeR"~ A/"(0, S). Let Q = Q T y and let A max (-) denote the maximum 
eigenvalue. Then 

Eexp (X(Y T QY - Tr(QE))) < exp (^X 2 Tt(Q^ VA G [—A, A] , 
where A = . 1 T — r - 

2A max ((S2j QS2 
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Proof. For Y ~ jV(0, S) let us define X = Q§y. Then X ~ Af(0,Q^EQi) and r T QF - 
Tr(gS) = X T X — Tr(QE). Consider = U T X where U is unitary and has the property that 
U T Q^EQhu = D and D is diagonal. Applying (20) to R T R - Tr(QE) = Y T QY - Tr(QE), 
and usine; the fact that T^QsEQs) = Tr(QE), and A max (Q5EQ5) = A max (E5QE2), we have 
the desired result. □ 

In the subsequent analysis, it will be more convenient to deal with normalized random 
variables, which we define as: 



X 



(k) 

g(i) 



Note that E X 



,(k) 



1. We also define: 



/v. .v..' 



and note that by symmetry of the covariance matrix = E g (jY ff (j) and hence pjj = PgU), g (i)- 
Define 



(fc) 



We observe that 



(ff(0.sO'))GO(ij) 



^»(0 A 5(J)' 



(21) 



y^-\o(i,j)\ Pij 



4 v y v 



(fe) 



where, 



V'; 



E (^ ) + ^g ) ) 2 -2|0(z,i)|(l + p ii ) 
(g(i),9(i))eO(v) 

E (^)-^)) 2 -2|0(M-)i(l-^). 



(22) 



(9(i),9(i))eO(i,i) 



For the covariance matrix of the normalized random variables occuring in the orbit of 
define = ||Ev(ij)||, the spectral norm restricted to the variables occuring in the 
orbit. Note that er^ < ||E|| and is hence bounded by a constant. Also recall that dij is the 
degree of the orbit 0(i,j). 

Lemma A. 3. The following inequalities hold: 

Eexp(XU^) < exp (\<D(i, j) | A 2 (l + Pij )) for all A e 



Eexp(A^ fc) ) < exp {\0{i, 3 )\\ 2 {l - Pij )) for all A G 
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Proof. Fix a particular sample k, a pair and consider the corresponding 0(i,j) and 

V(i,j). Let Z e ]Rl v ( J J)l be the vector of all the distinct nodes that occur on the edge orbit 
0(i,j). Let W + be the vector of all the terms X^X + X^X, and W~ be the vector of all 

the terms X^X — X^X. Then there are matrices A + and A~ (node-edge incidence matrices) 
such that 

W + = A + Z W~ = A~Z. 

Note that U\f = (W + ) T W + - 2\0(i,j) |(1 +fHj) and V<f } = {W-) T W- - 2\0(i, - Pij ). 
Let Hz be the submatrix of the covariance matrix associated to Z. Then we note that we 
can write 

(W + ) T W + = Z T (A + ) T A + Z 

where Z G R\ v ^ ~ Af(0, E z ). Note that by definition of A + and Z, Tr ((A+) T = 

20(i,j)(l + pij) (similar equality holds for A~). To be able to invoke Lemma A. 2 we need 
to bound two quantities: 

A max ((S|) T (A + ) T A + Sl) and A max ((e|) t \A~f ' A'T^j . 

Note that 

A max ((Sl) T (A + fA + Sl) < || {A+f A+\\\\H Z \\ 



By the defninition of A + , we have 
|| {A + ) T A + \\ = 



<<7tf|| (A + ) T A + ||. 

^(g(0,g(j))eo(»j)( g; ( t ) + x (i)) 2 
2 ^{g(i),g(j))£0(i,j) ( x ( 1 ) 2 + X UY 



max 



< max 



^2kev(i,j) x l 



2d tJ E 



2 



fcev(ij) x fc 



< max 2 

Z^fcev(«,i) x k 

< 2d ir 



Hence A max ^(£f ) T (v4 + ) T v4 + £^ — 2dij0~ij. (Similar inequalities hold for A ). 

Now using Lemma A. 2, the result follows. □ 



Lemma A.4. ^ = i^V.^. 
Proof. 



9 ee 



A E 



n ijXg[i)X g (j), 
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where is the number of times edge (and hence all other edges in the orbit) are 

repeated in the Reynolds average. Now note that the size of the orbit divides the size of 
the group, and in fact jQTjjji = n ij (hi fact both these are equal to the cardinality of the 

stabilizer of the edge Dividing throughout by ^Y,uY,jj, we get the desired result. □ 

Let 



n _, n 



tj. . - t V U {k) V- ■ - - V V {k) Y ■ - - V Y {k) (23) 

u « 4n\0(i,j)\ V » 4n\0(i,j)\ ^ V * *" 4n\0(i,j)\ ^ ' {M} 



Lemma A. 5. 



- Ey| > v' N -->-,;/) < P (\Ua\ > +P (|Vy| > ^ 



Proof. We have 

Sjj — Ejj _ 1 \ ^ y(k) _ 

^tiioblK'-^') (^(22)) 
= ^ - ^ (by (23)). 

Now 

- s^-i > ^>:„y: u i} = {\u tJ - v tJ \ >t}c jity > tj u j|vy > 

from which the conclusion follows. □ 

Proof of Theorem 3.3. Note that P (|E/ y | > f) < 2P (C/^ > |) . Now C7 ij = ^^-^ ££ =1 U^f , 
so that 

(1 n 

From Lemma A. 3, 

Eexp(A^ fc) ) < exp (\0(i, j)|A 2 (l + py)) for all A G 
By rescaling, we have 



Eexp ( — JL—AE/^ < exp f -&±^- A 2 ) for all A G 
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&ij d^j &ij 



Since the random variables U-^ are independent and identically distributed (they correspond 
to different samples) for k — 1, . . . ,n, we have that 



Eexp (XUij) < exp ( A 2 ) for all A G 

4 16|C?(«,j)|n ' 



n\(D(i,j)\ n\0(i,j)\ 



Thus Uij is pre-Gaussian with parameters r 2 = ^g^^ju an d A = ■ From Lemma A.l 

we have 

2t 2 n\Q(i,j)\ \ 



f\ 2 exp , forO<t< 



/ ^ zexp [ „ j , ioi t ^ lfiA „ 

Another way of rewriting the above is: 



l+py / ' — — 16dij(Tij 

16di,Uj 



t , , rr , A f ^ 2t 2 n|C(i,j)|\ / n|(9(i,j)|t 

P ( |ty > - J < max <( 2 exp ( \ v ) , 2 exp ' 1 



2/ l_ \ 1 "I - Pij J \ ^dij&ij 

By a similar argument, 

™A T , , t\ f / 2t 2 n|C(i,j)|\ / n\0(i,j)\t 

P MVJjl > - < max ^ 2exp / \ ,JM ) , 2exp ' 1 



1 Pij J V ^dij(Tij 

Let Ci = vsxaxij j 1+ 2 p , j^ 2 ^- j, and C2 = minjj j^ 2 -, j- (Note that all these quantities 

are constants since the spectral norm ||E|| is bounded by a constant by assumption). Using 
Lemma A. 5, we get the following. 

P (\±ij - E -| > y/E^tj < max jexp (-C x n\0{i,j)\t 2 ) ,exp ^_ g H<?M)l^ 

Noting that a/E^E^ is bounded by a constant, we replace t by -, = , and alter the 
constants in the exponent suitably to obtain the required result. □ 

B Proof of Theorem 4.1 

We define the following function that captures the norm sample complexity 



It will be convenient to work with the notation E = V<& (E n ). 

Proof. The proof is almost identical to the proof of [8, Theorem 1]. We present it below for 
the sake of completeness. Note that from Corollary 3.2, 

max |Ej~ — EjJ = Op (S(n,p)) . 
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Now we note that 

||7J(x:)-e||<||7;(e)-e|| + ||7;(i:) -7I(E)||, 

and we will focus on bounding the different terms. 

Recall that for a symmetric matrix M, \\M\\ < maXj = i v .. )P X^=i l-^y'l- (This is because 
ll-^ll < (||-^"||i,i||-^"||oo,oo) 5 = ||M||i,i)- As a consequence, the first term is bounded by 

p 

max l S u I I <t)< t^coip). 

3=1 

The second term can be bounded as 

v v 
\\T t {£) - T t {Z)\\ <maxj]|iy 1(1X^-1 >t,\Eij\ < t) +max^\E ij \m£ ij \ < t, |E -| > t) 

3=1 1 3=1 

P 

+ max^ |Ejj - E^lldEyl > t, |Ejj| > i) 

3=1 

= 1 + 11 + III. 



Now note that 



For term II we note that 



III < max |Ey — Ejj| max |Ejj| 9 t q = Op (c (p)t 9 <5(n,p)) 

3=1 



II < max Jl(\£ij - + |E^|)l(|Ey| < t, |E^-| > i) 



p p 
< max | E jo — EjJ > l(|Ej,-| > t) + f max ) l(|Ej,-| > t) 

U 1 u 

= P (c (p)t-' 1 5(n,p) + c (p)t 1 -' 1 ). 

To bound I, we pick 7 e (0, 1) and note that 

p p 

I < max |Ejj — Ejj |l(|Ey| >t,|Eij| < +max^|Ey|l(|E ij | < t) 

j=i % 3=1 

p p 

< max I Ejj Ejj |l(|Ey| > t, < 7t) + max^ | Ejj Ejj 

j=l % 3=1 

+ t 1 - q c (p) 

< max - E^l [ V Iflsi,- - E^| > (1 - 7 )i) + c (p)( 7 t)" <? ) + i 1_ %(p) 
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< max |Ejj — T, i j\c (p)('yt) q + t 1 q c (p) (with high probability) 
= Op (co(p)( 7 0"^(n,p) + t^coip)) . 
The second to last inequality follows from the fact that: 

P ^maxj] - S„| > (1 - 7)*) > j = P ^max |Ey - > (1 - 7)^ 

1 

< — 

pC 

for t as chosen in the statement of the theorem. Combining all these inequalities we get the 
required result. □ 



38 



